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The implementation of a full electronic structure calculation code on a hybrid parallel architec- 
ture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our 
implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good perfor- 
mances, systematic convergence properties and an excellent efliciency on parallel computers. Our 
0^ GPU-based acceleration fully preserves all these properties. In particular, the code is able to run 

C ^ ^ on many cores which may or may not have a GPU associated. It is thus able to run on parallel 

C ^ and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With 

^SJ double precision calculations, we may achieve considerable speedup, between a factor of 20 for 

, ^ some operations and a factor of 6 for the whole DFT code. 

I I I. INTRODUCTION 

^ The Kohn-Sham (KS) formalism of the Density Functional Theory (DFT) approach is a well-established first- 

' principles method for investigating properties of atomistic systems. In the recent years, the increasing of the com- 
putational power of modern supercomputers has further stimulated the interest of the community for electronic 
structure calculations of systems with many electrons. Systems which were untractable only few years ago become 
^ now accessible with the advent of modern machines. However, despite the approximate nature of the approach, the 
computational demand becomes huge already for systems with few hundreds atoms. For the most distributed DFT 
codes, the number of computational operations scales cubically with respect to the number of atoms in the system. 
^ As a result, the computational overhead for treating systems with large number of atoms now represents a serious 
""^ limitation for the maximum size of the system considered. A possible strategy to circumvent this problem can be 
^ provided by the development of linear scaling algorithms for electronic structure calculations, which have been widely 
O developed in recent years. Such kind of approaches have a crossover point with the traditional cubic codes which is 
I ^ I placed right around few hundreds atoms, and at present they represents one of the most promising strategies for ab 

initio simulation of big systems that exhibit quantum mechanical behaviour. 
T-H In the past few years, the possibility of using Graphic Processing Units (GPU) for scientific calculations has raised 

^ a lot of interest. A technology initially developed for home PC hardware has rapidly evolved in the direction of pro- 
£2 grammable parallel streaming processor. The features of these devices, in particular the very low price performance 
ratio, together with the relatively low energy consumption, make them actractive platforms for intensive scientific 
computations. A lot of scientific applications have been recently ported on GPU, including for example molecular dy- 
namics dU, quantum Monte-Carlo dH}, Finite Element Methods (7). In the domain of electronic structure calculation, 
the calculation of the exchange-correlation term in a Gaussian based DFT code has been implemented on GPU Q, 
as well as the evaluation of the Coulomb potential (j5| . Also the self consistent field calculation of the gaussian-based 
GAMESS code was ported on GPU ^ . Most of these implementation are performed on single precision calculation 
units, and with a single CPU core connected to a single GPU card. 

In this paper we will present an implementation of a full DFT code which can run on massive parallel hybrid 
CPU-GPU clusters. Our implementation is based on the achitecture of the most recent NVidia GPU cards (compute 
^ capability of type 1.3), which supports double precision floating point numbers. The routines which define the GPU 
^ kernels are thus coded within the specification of the CUDA language (|T|). The DFT code which is on the basis 
of our implementation is a systematic code based on Daubechies wavelet s([9j). The latter is a systematic, orthogonal 
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real-space basis set which presents optimal properties for expanding localised information. The properties of this basis 
will turn out to be optimal for an extension on a GPU-accelerated environment. This DFT code, named BigDFT 
([H]) , is a Free Software GNU-GPL code and is delivered either in a standalone version or integrated (in its pure CPU 
version) in the ABINIT (TU) software package. 

The paper is organised as follows: in the first section we will present the main features of the BigDFT code to 
see how the operators of the KS hamiltonian can be written in Daubcchies wavelets basis. We will then discuss 
the implementation of these operators in the GPU architectures, and inspect their performances separately from the 
complete DFT code. Next, we will inspect the problem of the transfer of the data on the card, and we will show the 
performances of the full DFT code on parallel environments. 

II. OVERVIEW OF THE BIGDFT CODE 

Let us briefly describe the main features of the DFT code on which our GPU developments are based. The 
BigDFT code (8), is a free software (GNU-GPL) code issued from a three years STREP european project. It is a 
pseudopotential DFT code, in which Kohn-Sham (KS) wavefunctions are expressed in Daubechies wavelets basis ®. 
The latter is a multi-resolution basis, with compact-support, orthogonal functions. 

The choice of the basis is of great importance in the computational operations which are performed by this code. 
To explain this fact in more detail let us illustrate the main operations which must be performed in the context of a 
DFT calculation. 

In the KS formulation of DFT, the KS wavefunctions are eigenfunctions of the KS Hamiltonian, with pseu- 
dopotential Vpsp'- 

+ VksIp] + ^psp) I*.) = ■ (1) 

The KS potential Vks [p] is a functional the electronic density of the system: 

p(r)= 5"n« |vl/,(r)|^ , (2) 

1=1 

where 77-occ is the occupation of orbital i. 

The KS potential I4cs[p] = ^ff [p] + [p] + Kxt contains the Hartree potential Vh, solution of the Poisson's equation 
V^Vf/ = — 47r/5, the exchange-correlation potential V^^c and the external ionic potential Vc-x^t acting on the electrons. 
In BigDFT code the pseudopotential term Vpsp is of the form of norm-conserving GTH-HGH pseudopotentials (flTI 
1121 IT3j) . which have a local and a nonlocal term, Vpsp — Mocai + Kioniocai- The KS hamiltonian can then be written as 
the action of three operators on the wavefunction: 

(^-^V^ + Vl + Konlocal) I*.) - £^1*^) , (3) 

where — +V^c+K!xt+^iocai is a real-space based (local) potential, and Vnoniocai comes from the pseudopotentials. 

As usual in a KS DFT calculation, the application of the hamiltonian is a part of a self consistent cycle, needed 
for minimising the total energy. In addition to the usual orthogonalization routine, in which scalar products (^'ij^'j) 
should be calculated, another operation which is performed on wavefunctions in BigDFT code is the preconditioning. 
This is calculated by solving the Helmholtz equation 

(-^V2-e,)|5,) = |ft), (4) 

where \gi) is the gradient of the total energy with respect to the wavefunction l^*;), of energy e^. The preconditioned 
gradient \gi) is found by solving Eq. Q by a preconditioned conjugate gradient method. 

A. Daubechies basis and wavelet transformation 

We describe in this section how the wavefunctions are expressed in the Daubechies basis. Though a more complete 
description can be found in ((SI, it is useful to revisit the principal points in view of a GPU implementation. 
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There are two fundamental functions in Daubechies family, the scaling function (j){x) and the wavelet ip{x). The 
full basis set can be obtained from all translations by a certain grid spacing h of the scaling and wavelet functions 
centered at the origin. Both functions are localized, with compact support. All the properties of these functions can 
be obtained from the relations 

m 

0(x) = \/2 J2 hj<t>{2x-j) (5) 

in 

^{x)^V2 J2 gj<l>{2x-j) 

j = l-m 

which relate the basis functions on a grid with spacing h and another one with spacing h/2. hj and gj = (—!)■' 
are the elements of a filter that characterizes the wavelet family, and m is the order of the family. From Eq. dsl, 
every scaling function and wavelet on a coarse grid of spacing h can be expressed as a linear combination of scaling 
functions at the finer grid level h/2. For this reason, wavelet functions complete the information which is lacking for 
refining the resolution level. In our implementation we use only two resolution levels, namely one level of adaptivity. 
We can thus classify the coarse and the fine degrees of freedom by the information expanded by the scaling and the 
wavelet functions respectively. 

For a three-dimensional description, the simplest basis set is obtained by a tensor product of one dimensional basis 
functions. For a two resolution level description, in each grid point the coarse degrees of freedom are expanded by a 
single three dimensional function (/)^^ (r) , while the fine degrees of freedom can be expressed by adding other seven 
basis functions, ^^^i.ja j3(r), which include tensor products with one-dimensional wavelet functions. 

A wavefunction ^'(r) can thus be expanded in this basis: 

E cn...3<...3W+ E E^?....3^^....3W (6) 

The sum over ii, i2, is runs over all the grid points contained in the low resolution region and the sum over ji, 
J2i J3 over all the points contained in the (generally smaller) high resolution region. Such points belong to a uniform 
mesh of grid spacing h. Each wavefunction is then associated to a set of coefficients {c^^ j^}, n = 0, ■ ■ ■ ,7. The 
wavefunctions are stored in a compressed form where only the nonzero scaling function and wavelets coefficients 
are stored. The basis set being orthogonal, several operations such as scalar products among different orbitals and 
between orbitals and the projectors of the non-local pseudopotential can directly be done in this compressed form. 

The local hamiltonian operator can be applied in the pure fine scaling function representation, (a basis set which 
contains only scaling functions centered on a finer grid of spacing h' = h/2): 

= E ■ (7) 

^1 '^3 

The transformation between a mixed coarse scaling function/ wavelet representation and a pure fine scaling function 
representation to is done by the fast wavelet transformation (I14p which is a three dimensional, separable convolution, 
that can be obtained from the filters hj and gj of Eq.([5|. 



B. The kinetic operator 

The matrix elements of the kinetic energy operator among the basis functions can be calculated analytically (fSl [T^ . 
For the pure fine scaling function representation described in Eq. ^ , the result of the application of the kinetic energy 
operator on this wavefunction, has the expansion coefficients s^'^^^i^^i^, which are related to the original coefficients 
Si' by a convolution 

1' 2' 3 ^ 

where 



2 

Ji '-^a 



and Ti are the filters of the one-dimensional second derivative in Daubechies scaling functions basis, which can be 
computed analytically. 
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C. Application of the local potential 



The potential Vl is defined in real space, in particular on the points of the finer grid of spacing h' . The application 
of the local potential in Daubechies basis consists of the basis decomposition of the function product yL(r)^(r). As 
explained in (|51 116p , the simple evaluation of this product in terms of the point values of the basis functions is not 
precise enough. A better result may be achieved by performing a transformation to the wavefunction coefficients, 
which allows to calculate the values of the wavefunctions on the fine grid, via a smoothed version of the basis functions. 
This is the so-called "magic filter" transformation, which can be expressed as follows: 

and allows to express with better accuracy the potential application. In other terms, the point values of a given 
wavefunction \^) are expressed as if ^'(r) would be the smoothest function which has the same Daubechies expansion 
coefficients of j^*). This procedure guarantees the highest precision {0{h^^) in the potential energy) and can be 
computationally expressed by a three-dimensional separable convolution in terms of the filters tOi. After application 
of the local potential (pointwise product), a transposed magic filter transformation can be applied to obtain Daubechies 
expansion coefficients of Vf^j^'). 



D. Full local hamiltonian 



The above described operations must be combined together for the application of the local hamiltonian 
(— -I- VL(r)). The order of the operation must be the following: 

1) Decompression of the data on the whole simulation grid, 

2) Wavelet transformation on the fine grid, 

3) Local potential application: 

3a) Magic filter transformation, 

3b) Potential multiplication, 

3c) Transposed magic filter transformation. 

4) Kinetic operator from the output of 2), 

5) Sum with potential application, 

6) Inverse wavelet transformation, 

7) Data compression. 

In the CPU version of BigDFT some of these operations are combined together, e.g. for the potential application 2) 
and 3a), and also 3c) and 6) are combined into a set of common filters, and the boundaries can be arranged such 
as to avoid data compression-decompression. However, in our GPU implementation we choose to separately apply 
these operations, which is simpler and accounts for better modularity. Moreover, different boundary conditions can 
be implemented immediately. 



E. Local density calculation 



The density of the electro nic sy stem is derived from the square of the point values of the wavefunctions, (see Eq. 
([2])). As described is section II. C a convenient way to express the point values of the wavefunctions is to apply the 
magic filter transformation to the wavefunctions expressed in Daubechies basis. The operations needed for calculating 
the local density would then be identical to the points 1), 2) and 3a) of the list at section II. D followed by an 
accumulation of the squares of the wavefunctions in the same array. 
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F. Other operations 

The local potential Vl can be obtained from the local density p by solving the Poisson's equation and by calculating 
the exchange-correlation potential V^c [p] ■ These operations are performed via a Poisson Solver based on interpolating 
scaling functions (jl8j) . which is a basis set tightly connected with Daubechies functions, optimal for electostatic 
problems, and which allows for mixed boundary conditions. A description of this Poisson solver can be found in the 
papers (fT^EUll . 

The complete hamiltonian contains also the nonlocal part of the pseudopotential which, thanks to the orthogo- 
nality of Daubechies wavelets, can directly be applied in the compressed form. All the other operations can also be 
performed in the compressed form. In particular, overlap matrices needed for the orthogonality constraints, and their 
manipulations, are implemented by suitable applications of BLAS -LAPACK routines. 



G. Parallelization for homogeneous computing clusters (CPU code) 

The localisation and the orthogonality of the basis functions are of key importance for improving the performances 
of the code. Indeed, thanks to their relatively short filters, the convolutions can be efhciently optimised. Two 
data distribution schemes are used for parallelising the code. In the orbital distribution scheme (used for hamiltonian 
application, preconditioning) , each processor works on one or a few orbitals for which it holds all its scaling function and 
wavelet coefficients. The other operations are performed in the coefficient distribution scheme, where each processor 
holds a certain subset of the coefficients of all the orbitals. For an homogeneous distribution between orbitals, switch 
back and forth between the orbital distribution scheme and the coefficient distribution scheme is done by the MPI 
global transposition routine MPI_ALLTOALL. Further flexibility has been added by allowing each processor to store 
variable number of orbitals and/or wavefunction coefficients. This is particularly useful for an hybrid architecture, 
when only some of the cores can be in relation with some hardware accelerator (a GPU for example) . For this variable 
repartition case, the communication are performed with suitable calls to MPI_ALLTOALLV routines. For parallel 
computers where the cross sectional bandwidth (22) scales well with the number of processors this global transposition 
does not require a lot of CPU time. The BigDFT code shows very good computational performances and an excellent 
efficiency (above 90%) on parallel computers. 



III. IMPLEMENTATION OF CONVOLUTIONS ON GPU 
A. Wavelet transformation and Magic Filters convolutions 

As discussed above, the operations we are going to implement on GPU are essentially three-dimensional convolutions 
with short, separable filters. In order to present the computational implementation, let us analyse in detail the magic 
filter transformation. A three-dimensional array s (input array) of dimension ni,n2,n3 (the dimension of the grid of 
spacing h') is transformed into the output array ^'^ given by 

u 

■^rih.h, h) = X! ^ji^j2^]3s{Ii - ji.h - j2,h - js) ■ (11) 

With a lowercase index ip we indicate the elements of the input array, while with a capital index Ip we indicate the 
indices after application of the magic filters cui, which have extension from —L to U. The filter dimension U + L + 1 
equals to the order of the Daubechies family m, which is 16 in our case. In BigDFT, different boundary conditions 



(BC) can be applied at the border of the simulation region, which affects the values of the array s in (111 when the 
indices are outside bounds. For our GPU implementation we use periodic BC in the three dimensions, i.e. data are 
wrapped periodically, tough the original CPU version of these convolutions also admits isolated and slab-like BC. 
However, our GPU implementation can be straightforwardly extended to these BC. 

The most convenient way to calculate a three dimensional convolution of this kind is by combining one dimensional 



convolutions and array transpositions, as explained in (21) In fact, the calculation (111 can be cut in three steps: 

1) ^3(4,H,«2) = Wjs(ii,i2,-?^3 - j) Vzi,i2; 

2) F2{l2,l3,H) ^J2j^jP3{l3,h,l2 - j) V/3,Zi; 

3) ^rih.h, h) - E, ^jF2{l2.h, h - j) V/2, /3 
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The final result is thus obtained by a successive application of the same operation: 

u 

F(/,a)= ^ c^,G(a,/-j) Va = 1, • • • , iV , / = 1, • • • , n . (12) 

The lowest level routine which will be ported on GPU is then a set of N separate, one dimensional (periodic), 
convolutions of arrays of size n. The number TV equals nin2, nin^ and 712^3 respectively, for each step of the three 
dimensional construction, while n equals to the dimension which is going to be transformed. The output of the first 
step is then taken as the input of the second, and so on. 



B. NVidia GPU Architecture 

A NVidia GPU is composed of a global memory and a set of multiprocessors. Each multiprocessor includes eight 
computing units (cores) and a "private" shared memory. To obtain optimal performance, it is necessary to store the 
data in the private memory of multiprocessors. The CUDA programming model (1) defines mechanisms allowing the 
data transfer in the shared memory. Fine-grained threads are the basic means of parallel execution in CUDA. Indeed, 
A CUDA program is mapped on a set ("grid") of execution blocks running on multiprocessors. A block contains 
several threads executed by the processors of a multiprocessor. The processors execute the threads in a synchronous 
way (data parallel threads). In the following section, we will show the design and implementation of the Magic Filter 
convolution described in Sec lIII.AI on NVIDIA CPUs. 



C. Implementation details 

From the GPU parallelism point of view, there is a set of N independent convolutions to be computed. Each of the 
lines of n elements to be transformed is split in chunks of size . Each multiprocessor of the graphic card computes 
a group of Ni different chunks and parallellises the calculation on its computing units. Figure [T] shows the data 
distribution on the grid of blocks during the transposition. 

In order to get best performance with the GPU, its is strongly recommended to transfer data from the global to 
the shared memory of multiprocessors. The shared memory must contain buffers to store the data needed for the 
convulation computations. The desired boundary conditions (periodic in our case) is implemented in the shared 
memory during the data transfer. Each thread computes the convolution for a subset of elements associated to 
the block. This data distribution is illustrated in Fig. [2] 



D. Kinetic convolution 

The GPU implementation described for the magic filter convolution can be applied also to the wavelet transforma- 
tion, by suitably changing the values of the filters and their extensions. This can be done thanks to the fact that also 



the latter is a three-dimensional separable convolution, i.e. it has the same formal expression as Eq.( 11 1. A little, but 
substantial, difference should be stressed for the kinetic operator application, defined in Eqs.([8]) and ([9|. In this case 
the three dimensional filter is the sum of three different filters. This implies that the kinetic filter operation must be 
cut differently from the other separable convolutions: 

1) K3{l3,ii,i2) ^J2jTjs{ii,i2,l3 - j) Vii,«2; 

3) S{h,l2,h) - Ej T,s{h ~ h h,h) + K2{l2,h, h) V/2, h. 

Also in this case, the three dimensional kinetic operator can be seen as a results of a successive application of the 
same operation: 

Kp{I,a)^Y.^oGp-i{a,I-]) + Kp^i{a,I) et Gp{I,a) ^ Gp^i{aJ) Va, / , (13) 
3 

In other terms, the one-dimensional kernel of the kinetic energy has two input arrays, Gp-i and Kp^i, and returns 
two output arrays Kp and the Gp, which is the transposition of Gp_i. At the first step (p — 1) we put Go — s and 
Kq = 0. Eventually, for p = 3, we have K3 — s and G3 — s. One can put instead Kq ^ 0, and in that way will be 



= s + Kq. This allows to unify step 4) and 5) in the list of section II. D by putting Kq equal to the output of 



step 3c). This algorithm can be also used for the Helmholtz operator of the preconditioner, by putting Kq = —eiS. 
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FIG. 1 Data distribution for ID convolution+transposition on the GPU. Input data (left panel) are ordered along the A''-axis, 
while output (right panel) is ordered in n-axis direction, see Eq. (121. When executing GPU convolution kernel, each block 
of the execution grid is associated to a set of A'^ (A^-axis) times A^e (n-axis) elements. The size of the data feed to each 
block is identical (such as to avoid block-dependent treatment), hence when A'^ and n are not multiples of A'^ and A^e, some 
data treated by different blocks may overlap. This is indicated with the filled patterns in the figure. Behind the label, in 
light gray, it is indicated the portion of data which should be copied to the shared memory for treating the data in the block. 
See Fig. [2] for a detail of that part. 



IV. PERFORMANCE EVALUATION OF GPU CONVOLUTION ROUTINES 



To evaluate the performance of ID convolution routines described in Eqs.( 13 1 and ( 12 ), together with the analogous 
operation for the wavelet transformation, we are going to compare the execution times on CPU and GPU. We define 
the GPU speedup with the ratio between CPU and GPU execution times. For these evaluations, we used a computer 
with an Intel Xeon Processor X5472 (3 GHz) and a NVidia Tesla S1070 card. The CPU version of BigDFT is 
deeply optimized with optimal loop unrolling and compiler options. The GPU code is compiled with the Intel 
Fortran Compiler (10.1.011) and the most aggressive compiler options (-02 -xT). With these options the magic filter 
convolutions run at about 3.4 GFlops. Since the GPU convolutions are called by an electronic structure code, all 
benchmarks are performed with the double precision floating point numbers proposed by the recent GPU. 

The performance graphs for the three above mentioned convolutions, together with the compression-decompression 
operator, are indicated in Fig. [3] as a function of the size of the corresponding three dimensional array. 



A. Three dimensional operators 

As described in the previous sections, to build a three-dimensional operation one must chain three times the 
corresponding one-dimensional GPU kernels. We obtain in this way the three-dimensional wavelet transformations 
as well as the kinetic operator and the magic filter transformation (direct and transposed). The multiplication with 
the potential and the calculation of the square of the wavefunction are performed via the application of some special 
GPU kernels, based on the same guidelines of the others. Also the wavefunction compression-decompression can be 
calculated in the card. The reduction operations, like for example the calculation of the potential and kinetic energy 
can be seen as linear algebra operations and can thus be performed with suitable calls to the corresponding CUBLAS 
routines. The GPU speedup of the local density construction as well as the local hamiltonian application and of the 
preconditioning operation is represented in Fig. [4] as a function of the compressed wavefunction size. 
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FIG. 2 Data arrangment in shared memory for a given block. The number of hnes Ng is chosen to be a divisor of the half-warp 
size. Data are then treated in units of warpSize/2. The thread index has (warpSize/2, Nt) elements, with Nt < 16 (left panel). 
Each group of threads (*, i) of the half-warp i treats a definite number of elements, either for copying the data (center panel) 
or for performing the calculation(right panel). This data arrangement ensures the avoiding of bank conflicts in the shared 
memory access. For calcuating the convolution, two buffers of sizes NgL and NfU must be created in shared memory. This 
figure reproduces the portion of the input data higlighted in gray in Fig. [T] 
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FIG. 3 Double precision speedup for the GPU version of the fundamental operations on the wavefunctions as a function of the 
single wavefunction size. 
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FIG. 4 Double precision speedup for the GPU version of the three-dimensional operators used in the BigDFT code as a function 
of the single wavefunction size. 



V. TOWARD A COMPLETE HYBRID CODE 

In this section, we will discuss about the main issues for an efficient execution of hybrid code. The memory transfers 
between memories of CPU and GPU using the PCI-Express bus are well known to be potentially a bottleneck, and 
consequently a major obstacle to good performance. So, we have to adapt the algorithm to reduce the memory 
transfers as most as possible. In our case, this can be done thanks to the scheduling of operations in the BigDFT 
code. Indeed, an optimized iteration of a single wavefunction is organized as follows: 

I) Density construction, 

II) Poisson solver Local potential (V// + T^c), 

III) Local hamiltonian, 

IV) Non-Local hamiltonian, 
V) Wavefunction residue, 

VI) Residue preconditioning, 
VII) Wavefunction update, 

VIII) Orthogonalization (Cholesky factorization). 

The wavefunctions do not evolve between steps I) and III). So, they can be transferred on the GPU (global memory) 
before computing the density. After the step III), they can be sent back to the host memory (CPU), thus saving 
two memory transfers. Morever, in this way, computation time is saved since the operations needed for the density 
calculation coincide with the first part of the operations for the local hamiltonian, see Sec. |II.E[ Given the above 
scheduling of operations, the full CPU-CPU implementation of BigDFT code can be efficiently implemented. In the 
current hybrid implementation, we can execute on the GPU the steps I), III) and VI) and also all BLAS routines 
performed in steps V) (DGEMM) and VIII) (DSYRK). However, all other operations, such as LAPACK routines 
(step VIII) or the multiplication with the nonlocal pseudopotentials (step IV) are still executed on the CPU and can 
be ported on GPU. We left these implementation to future versions of the hybrid code. 



A. Parallel distribution 

Hybrid architectures are becoming more and more popular with configurations of several CPU and GPU. Typically, 
a configuration may be composed of two quad-cores processors (Intel Xeon or Nehalem) and two NVidia GPU. So, 
in this case, the two GPU have to be shared between the eight CPU cores. The problem of data distribution and 
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load balancing is a major issue to achieve optimal performance. The operators implemented on GPU are parallelized 



within the orbital distribution scheme (see Sec. II.G). This means that each core may host a subset of the orbitals, 
and apply the operators of Sec. |IV.A| only to these wavefunctions. 

A possible solution to the GPU sharing is to dedicate statically one GPU to one CPU core. So, in the common 
configuration, two CPU cores are more powerful because they have access to GPU. The six other CPU cores do 
not interact with the GPU. Since the number of orbitals which may be assigned to each core can be adjusted, a 
possible way to handle the inhomogeneity CPU/GPU would be to assign more orbitals to the cores which have a 
GPU associated. Such kind of approach can be realised thanks to the flexibility of the data distribution scheme of 
BigDFT. However, it may be difficult for the end user to define an optimal repartition of the orbitals between the 
different cores for a generic system. 

For this reason, we have designed an alternative approach where the GPU arc completely shared by all CPU cores. 
The major consequence is that the set of orbitals is equally distributed to CPU cores. Essentially, each of the CPU 
cores of a given node is allowed to use one of the GPU cards which are associated to the node, so that each card 
dialogues with a group of CPU cores. Each GPU is associated to two semaphores, which control the memory transfers 
and the calculations. In this way the memory transfers of the data between the different cores and the card can 
be executed asynchronously and overlapped with the calculations. This is optimal since each orbital is processed 
independently. The technical details of this implementation will be described elsewhere, via a more technical paper. 
In the next section, we will study the performance of the hybrid BigDFT code. 



VI. PERFORMANCE EVALUATION OF HYBRID CODE 

The three-dimensional operator described in previous sections can be used in the full DFT code with the implemen- 
tation of the memory transfers described in Sec|V] As a test system, we used the ZnO crystal, which has a wurtzite 
bulk-like structure. Such system has a relatively high density of valence electrons so that the number of orbitals is 
rather large even for a moderate number of atoms. Our calculations are performed on the hybrid part of the CINES 
IBLIS machine, which has 12 nodes, connected with an Infiniband 4X DDR connectX network, each node (2 Xeon 
X5472 quadri-core) connected with 2 GPU of a Tesla S1070 card. 

We performed two kinds of tests. In the first one, to control the behaviour of the code for systems of increasing 
size, we performed a set of calculations for different supercells with increasing number of processors, such that the 
number of orbital per MPI process is kept constant. We performed a comparison for the same runs in which all the 
CPU cores have a GPU associated. In the second test we kept fixed the size of the system and increased the number 
of MPI process such as to decrease the number of orbitals per core. We then controlled the speedup of each run by 
varying the number of cores associated to a GPU. This could be done thanks to the data transfer overlap which allows 
for multiple cores to share the same card, as described in Sec. |V.A[ 



A. Homogeneous CPU/GPU repartition 

For the first test, results are shown in Fig. [5] Our comparison shows an overall speedup of the whole code by a 
factor of around 6. This results is interesting and sounds very promising for a number of reasons. First of all, as 
already discussed, not all the routines of the code were ported on GPU. We focus our efforts to the operators which 
can be written via a convolution. Also the application of the non-local part of the hamiltonian can be performed 
on the GPU, and we are planning to do this in further developments. Moreover, the actual implementation of the 
GPU convolutions can be further optimised. For example, in the preconditioner the wavelet transformation can be 
combined with the kinetic operator (as it is done in the CPU version) in order to define a faster GPU kernel. This 
whill allow to save more time, since the preconditioner still represents a considerable fraction of the overall time also 
in the GPU version, see Figj6] Also the linear algebra operations can be further optimised. For the moment, only the 
calls to the BLAS routines were accelerated on the GPU, via suitable calls to the corresponding CUBLAS routines. 
Also the LAPACK routines, which are needed to perform the orthogonalisation process, can be ported on GPU, with 
a considerable gain. Indeed the linear algebra operations represent the most expensive part of the code for very large 
systems (see In figure 7] the speedup of the different CPU-related parts of the runs of Figjsjis shown. As it can 
be seen, for large systems the overall speedup is dominated by the linear algebra operations. An optimisation of this 
section is then crucial for future improvements of the hybrid code. 
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FIG. 5 Relative speedup of the hybrid DFT code wrt the equivalent pure CPU run, as a function of the number of orbitals. The 
calculation is performed in parallel such that each processor holds the same number of orbitals (36 in this figure). The number 
of atoms of each system is eight times the number of cores considered. Also the time in seconds for a single minimisation 
iteration is indicated, showing a speedup of a factor of around 6 with the hybrid CPU-GPU architecture, in double precision 
computation. 



B. Inhomogeneous CPU/GPU repartition 

The second test mainly concerns the performances of the code in the case in which several MPI processes (and thus 
CPU cores) are associated to the same card. We perform different runs for different repartitions of the card per core, 
such as to check the speedup by varying the ratio GPU/CPU on a hybrid run. Results are plotted in Figjsj We can 
see that the slowing down in the speedup with respect to the ratio GPU/CPU is less abrupt than it can be expected. 
In particular, results are not so much altered for a ratio of 50% (1 GPU per 2 CPU cores) and only slow down of 
a factor of around two with a ratio of 25%, i.e. four cores per card. This is particularly encouraging since at the 



moment only the convolutions operators are de-syncronised by the semaphores (see Sec V.Al, and the BLAS routines 
are executed at the same time on the card. Future improvements in this direction may allow us to better optimise 
the load on the cards such as to further increase the efficiency. 



VII. CONCLUSIONS AND PERSPECTIVES 

The port of the principal sections of a electronic structure code over graphic processing units (GPU) has been shown. 
Such GPU sections have been inserted in the complete code, in order to have a production DFT code which is able 
to run in a multi-CPU environment. The DFT code we used, named BigDFT, is based on Daubechies wavelets, and 
has high systematic convergence properties, very good performances, and excellent efficiency on parallel computation. 
The GPU implementation of the code we propose fully respects these properties. We use double precision calculations, 
and we may achieve considerable speedup for the converted routines (up to a factor of 20 for some operations). Our 
developments are fully compatible with the existing parallellisation of the code, and the communication between CPU 
and GPU does not affect the efficiency of the existing implementation. The data transfers between the CPU and the 
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FIG. 7 Speedup of the GPU-related sections of the code for the systems of Fig[5] 
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FIG. 8 Speedup of the full DFT code as a function of the number of cpu cores (i.e. MPI processes) for inhomogeneous 
repartitions of the number of GPU cards per core. The speedup is about 3.5 for a 1/4 repartition (four MPI processes per 
card), while for a 1/2 repartition it tends to be of the same order of the homogeneous case (see Fig[7]for results on different 
systems). 

GPU can be optimised in such a way to allow that more than one CPU core is associated to the same card. This is 
optimal for modern hybrid supercomputer architectures in which the number of GPU cards is generally smaller than 
the number of CPU cores. We test our implementation by running systems of variable number of atoms on a 12-node 
hybrid machine, with two GPU cards per node. These developements produce an overall speedup on the whole code 
of a factor of around 6, also for a fully parallel run. It should be stressed that, for these runs, our code has no hot-spot 
operations, and that all the sections which are ported on GPU contribute to the overall speedup. Moreover, given the 
nature of the parallelisation of the BigDFT code, we expect these results to hold also on bigger systems in massive 
parallel hybrid environments, such as for example, the machine which is going to installed in France, at the CCRT, 
in the near future. 

The GPU port of the routine was performed for fully periodic boundary conditions(BC), but also different BC can 
be implemented without altering the nature of the developments. This is particularly interesting since it means that 
all these developments are totally compatible with the addition of new functionalities, like for example a linear scaling 
approach of the BigDFT code, which is under preparation. 

Such results can however be further improved by optimising the present GPU routines, and by accelerating other 
sections of the code. The hybrid BigDFT code, like its pure CPU counterpart, is available under GNU-GPL license 
and can be downloaded form the site in Ref.(8). Up to our knowledge, it is the first time that a systematic electronic 
structure code has been able to run on hybrid (super)computers in (massively) parallel environement. In summary, our 
results open a path toward the use of GPU for double precision density functional theory calculations and encourage 
us to proceed further in these direction. 
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